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A novel code for the approximate computation of long-range forces be- 
tween N mutually interacting bodies is presented. The code is based on 
a hierarchical tree of cubic cells and features mutual cell-cell interactions 
which are calculated via a Cartesian Taylor expansion in a symmetric way, 
such that total momentum is conserved. The code benefits from an im- 
proved and simple multipole acceptance criterion that reduces the force 
error and the computational effort. For N > 10 4 , the computational costs 
are found empirically to rise sub-linear with N. For applications in stellar 
dynamics, this is the first competitive code with complexity O(N), it is 
faster than the standard tree code by a factor ten or more. 
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1. INTRODUCTION 

In iV-body simulations of stellar dynamics (or any other dynamics incorporating 
long-range forces), the computation, at every time step, of the gravitational forces 
between N mutually interacting bodies dominates the operational effort. In many 
situations, such as studies of collisionless stellar systems, the error of these simula- 
tions is dominated by the noise in the distribution of the bodies, whose number N is 
just a numerical parameter. Therefore, instead of computing the forces exactly by 
direct summation over all pairs of bodies, one may use approximate but much faster 
methods, allowing substantially larger TV and hence significantly reduced noise. 

In stellar dynamics, one of the most commonly used approximate methods is the 
Barnes & Hut tree code [1] and its clones. With these methods, one first sorts the 
bodies into a hierarchical tree of cubic cells and pre-computes multipole moments 
of each cell. Next, the force at any body's position and generated by the contents 
of some cell is calculated by a multipole expansion if the cell is well-separated 
from the body; otherwise the forces generated by the cell's child nodes are taken. 
This technique reduces the number of interactions per body to C(log N) and hence 
results in an overall complexity of Q(N log N). 

Another technique used frequently, for instance in molecular dynamics, is Green- 
gard & Rokhlin's [2, 3] fast multipole method (FMM) and its variants. Tradition- 
ally, these methods first sort the bodies into a hierarchy of nested grids, pre-compute 
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multipole moments of each cell, and then compute the forces between grid cells by 
a multipole expansion, usually in spherical harmonics. That is, cells are not only 
sources but also sinks, which formally reduces the complexity to O(N), although 
the author is not aware of an empirical demonstration in three dimensions 1 . 

Currently, no useful implementation of FMM for application in stellar dynamics 
exists. In fact, it has been shown [5] that, for this purpose, the FMM in its tradi- 
tional form cannot compete with the tree code. The reason, presumably, is two-fold: 
first, because stellar systems are very inhomogeneous, non-adaptive methods are 
less useful; second FMM codes are traditionally designed for high accuracy, whereas 
in collisionless stellar dynamics a relative force error of few 10~ 3 is often sufficient. 

Here, we describe in detail a new code, designed for application in the low- 
accuracy regime, that combines the tree code and FMM whereby taking the better 
of each. In order to be fully adaptive, we use a hierarchical tree of cubic cells. The 
force is calculated employing mutual cell-cell interactions, in which both cells are 
source and sink simultaneously. Whether a given cell-cell interaction can be exe- 
cuted or must be split, is decided using an improved multipole-acceptance criterion 
(MAC). The new code is a further development of the code presented in [6], which 
in turn may be considered an improvement of a code given in [7] . It is substantially 
faster than the tree code and empirically shows a complexity of O(N) or even less. 

The paper is organized as follows. In §2, the numerical concepts are presented; 
§3 describes the algorithm; in §4 the force errors are empirically assessed for some 
typical stellar dynamical test cases; §5 presents empirical measurements of the 
complexity and performance, also in comparison to other methods; finally, §6 sums 
up and concludes. 

2. APPROXIMATING GRAVITY 

The goal is to approximately compute the gravitational potential $ and its deriva- 
tive, the acceleration, at all body positions Xi and generated by all other A — 1 
bodies 

Hx i ) = -J2^j9(xi-x j ), (1) 

where \ii is the weight of the zth body. Consider two cells A and B, see Figure 1, 
with centers of mass za and Zb, respectively. The Greens function describing the 
mutual interaction between a body at position x in cell A and a body at position 
y in cell B may be Taylor expanded about the separation R = za — zb ■ 

g(x - y) = £ 1 (x - y - fl)W V^g(R) + K p (g), (2) 

n=0 

where p is the order of the expansion, while 1Z P denotes the Taylor series remainder 
(see also Appendix A). Here, we follow Warren & Salmon [7] by using the notational 
shorthand in which x^ indicates the ra-fold outer product of the vector x with 



1 One of the most recent members of the FMM family, presented by Cheng, Greengard & Rokhlin 
[4], does still not show clear O(N) behavior at N = 10 6 (see, e.g., table I of their paper). 
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FIG. 1. Two interacting cells (boxes). Bodies are shown as solid dots. The stars indicate 
the positions of the centers of mass, the solid and dotted circles around of which have radii r max 
and r cr i t , respectively, for 9 = 0.5 



itself, while denotes a tensor inner product. When inserting (2) into (1), whereby 
restricting the sum over j to all bodies within cell B, we obtain for the potential at 
every position x in cell A and generated by all bodies in cell B [7] 

®b^a(x) = - E ^ (* - *A) (m) C^'Za + ^p(*b^a) (3) 

m=0 

C B^A - E V (n+ " l) 5( J R) © M B, (4) 

n=0 

MS = E My*-ZB) (n) . (5) 

y s eB 

The summation over m in equation (3) represents the evaluation of gravity, rep- 
resented by the field tensors Cg_^ A , at the evaluation point x within the sink cell 
A. The computation of the field tensors via the summation over n in equation (4) 
represents the interaction between sink cell A and source cell B, represented by its 
multipole moments Mg. 

The symmetry between x and y of the Taylor expansion (2) has the impor- 
tant consequence that, if equations (3) and (4) are used to compute V$b^a and 
V<J>a^b, Newton's third law is satisfied by construction. For instance, the sum 
over all forces of N bodies vanishes within floating point accuracy. 

Note that the highest-order multipole moments, M n_p , contribute only to the 
coefficients C 0,p , and hence affect only $ but not V<1>. Since, in stellar dynamics 
the acceleration rather than the potential is to be computed, one may well ignore 
M p , reducing CPU-time and memory requirements. 

The formulae used in the standard tree code can be obtained by setting x = za, 
corresponding to body sinks. In this case, potential and acceleration are approxi- 
mated by —Cq^ a and Cg P _ >A , respectively. 

2.1. Gravity Between Well-Separated Cells 

In our implementation, we stick to a third order expansion (p = 3), whereby ignor- 
ing octopoles M B (see remark above). The dipole Mg vanishes by construction and 
the Taylor-series coefficients for spherical Greens functions read (with Einstein's 
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sum convention) 



C| .a.,, = M,, ,i ; /)- • /,>,/,',/;-';. 

C|^ A , ijfc = m b [{SijRk + 8 jk R t + 5 ki Rj)D 2 + RiRjR k D 3 ] , 

where M B = M B is the mass of cell B and M| = M B /M B its specific quadrupole 
moment, while 



(7) 

r=|«l 



In practice, we calculate the coefficients C B _^ A = M A C B _^ A , because these obey 
the symmetry relations C|^ A = C A ^ B and C B ^ A = — C A ^ B (for p = 3), which 
arise from the mutuality of gravity. Since we consider always both directions of any 
interaction, exploiting these relations substantially reduces the operational effort. 



2.2. Accumulating Taylor Coefficients 

After, for each cell, the coefficients of all its interactions have been accumulated 
as C A = J2b *-bUa' wnere the sum includes all interaction partners B of cell A, 
we transform back to C A = C A /M A . Next, for each body, the Taylor series of 
all relevant cells (those that contain the body), have to be accumulated by first 
translating to a common expansion center and then adding coefficients. 

In contrast to expansions in spherical harmonics, the translation of the Cartesian 
expansion (3) to a different center z is straightforward. Let C™' p be the coefficients 
for expansion center Zq, then the coefficients for expansion center z\ are 

p-m 

c r p = E A (*o - zi) {n) © (8) 

n=0 



2.3. The Multipole Acceptance Criterion (MAC) 

For the expansion (3) to converge, we must have \x — y — R\ < \R\ for all body-body 
interactions 'caught' by a single cell-cell (or cell-body) interaction (see Appendix A). 
In order to ensure this, we first obtain, for each node, an upper limit r max for the 
distance of any body within the node from the center of mass (bodies naturally 
have r max = 0). We take r max to be either the distance from the cell's center of 
mass to its most distant corner [8], or [9] 

max {ri, max + |z - z,|}, (9) 

child nodes i 

whatever is smaller. Then \x — y — R\ < 8\R\ V x £ A and y <G B, i.e. the nodes 
are well-separated, if 



|-R| > ^A.crit + 7-B,crit with r crit = r max /6». 



(10) 
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The tolerance parameter 6 controls the accuracy of the approximation: for Newto- 
nian forces, the error made in d dimensions by the pth order expansion of the form 
(3) is (equation (A. 7)) 

0P+ 2 d _ 2 0P+ 2 {d-2)/d (u \ 

(1 - 6) 2 B < max (1 - 6) 2 B [ ' 

where M B cx r B max has been assumed. Since M B /i? 2 is, to lowest order, the ac- 
celeration due to the interaction, equation (11) tells us that with constant 9, which 
is standard tree-code practice, the relative error introduced by every single inter- 
action is approximately constant. Equation (12), however, shows that for the most 
important case of d = 3 this practice results in larger absolute errors for interac- 
tions with bigger and hence on average more massive cells. This implies that these 
interactions will dominate the total error of any body's acceleration. It is, therfore, 
expedient to balance the absolute errors of the individual interactions, which can be 
approximately achieved by using a mass-dependent tolerance parameter as follows 

np+2 dP+2 / M \(2-d)/d 

' ' (13) 



(i-0) 2 (i-e min ) 2 \M 



'tot 



where M tot is the mass of the root cell, while m ; n = 0(M tot ) is the new tolerance 
parameter. If rA,max ~ ?"B,max and M A <~ M B , this method results in approximately 
constant absolute acceleration errors. Note that equation (13) results in a very weak 
decrease of 9 with increasing mass: 6 cx M -1 / 15 for d = 3, p = 3 and 8 <C 1. 

Instead of equation (12), one can obtain a stricter error limit incorporating the 
first p+1 multipole moments (see Appendix A), which may be turned into an MAC 
[7]. However, our choice (13) is (i) much simpler and (ii) overcomes already the 
main disadvantage of 9 — const, the variations of the absolute individual errors. 

2.4. Direct Summation 

For small N the exact force computation via direct summation is not only more ac- 
curate than approximate methods but also more efficient. Therefore, we replace the 
approximate technique by direct summation whenever the latter results in higher 
accuracy at the same efficiency, see Appendix B for more details. 

If an interaction is executed by direct summation, the Taylor coefficients of the 
interacting cell(s) are not affected, but the coefficients C° and C 1 of the bodies 
within the cell(s) are accumulated. 

3. THE ALGORITHM 

3.1. Tree Building and Preparation 

In the first stage, a hierarchical tree of cubic cells is build, as described in [1], albeit 
cells containing s or less bodies are not divided. Next, the tree is linked such that 
every cell holds the number of cell children 2 , a pointer to its first cell child, as 



Hereafter 'child' means a direct sub-node of a cell, while 'descendant' refers to any node 
contained within a cell, including the children, the grand-children and so on. 
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well as the number of body children, of all body descendants, and a pointer to the 
first body child. Since cell as well as body children are contiguous in memory (to 
arrange this we actually use tiny copies of the bodies, called 'souls', that only hold 
a pointer to their body and its mass, position, acceleration and potential) these 
data allow fast and easy access to all cell and body children of any given cell and 
to all bodies contained in it. 

Next, the masses M , centers of mass z, radii r max and r cr it, and specific quadrupole 
moments M 2 of each cell are computed in a recursive way from the properties of 
the child nodes. 

3.2. A new Generic Tree- Walk Algorithm 

One important new feature of our code is the mutual treatment of all interactions: 
both interacting nodes are source and sink simultaneously. The standard tree- walk, 
as for instance implemented by the generic algorithm given in [7], as well as the 
the usual FMM coefficient accumulation algorithm, e.g. in [4], contain an inherent 
asymmetry between sinks and sources, and thus cannot be used for our purposes. 
Instead, our algorithm approximates the forces in two steps: an interaction phase, 
incorporating equation (4), and an evaluation phase, incorporating equation (3). 

3.2.1. The Interaction Phase 
Because of the mutuality of the interactions, we cannot accumulate the Taylor 
coefficients C" 'on the walk', but each node must accumulate the coefficients of all 
its interactions in its own private memory. Cells need storage for C° to C 3 , while 
bodies only need to accumulate C° and C , i.e. potential and acceleration. The 
accumulation of these coefficients is done by the following algorithm with the root 
cell for arguments A and B. 

Algorithm I (Interact(node A, node B)). 

try to perform the interaction between nodes A and B; // see Appendix B 
if(it cannot be performed) 

if (A = B) // split cell self-interaction 

for(all pairs {a, b} of child nodes of A) 
Interact(o, b); 

else if (r max (A) > r max (B)) // split bigger node 

for(all child nodes a of A) 

lNTERACT(a, B); 

else 

for(all child nodes b of B) 
Interact^, 6); 

Thus, if an interaction cannot be executed, using the formulae of the last section 
- see Appendix B for details - it is split. In case of a mutual interaction, the bigger 
node is divided and up to eight new mutual interactions are created, while a self 
interaction of a cell results in up to 36 new interactions between its child nodes. In 
practice, we use a non-recursive code incorporating a stack of interactions. 



3.2.2. The Evaluation Phase 
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Finally, the Taylor coefficients relevant for each body are accumulated and the 
expansion is evaluated at every body's position. After transforming C to C for 
every cell and body, this is done by the following recursive algorithm, which is 
initially called with the root cell and an empty Taylor series as arguments. 

Algorithm 2 (Evaluate Gravity(cell A, Taylor series T )). 
Ta = Taylor series due to the C™ of cell A] 

translate center of T to center of mass of A] // using equation (8) 

Ta += To; // add up coefficients 

for(all body children of A) { 

evaluate Ta at body's position; // as in equation (3) 

add to body's potential and acceleration; 

} 

for (all cell children C of A) 

Evaluate Gravity(C, Ta); //recursive call 

Thus, the coefficients of the Taylor-series that is eventually evaluated at some 
body's position have been added up from all hierarchies of the tree and hence 
account for all interactions of all cells that contain the body. 

4. ERROR ASSESSMENT 

Two types of force errors are involved in collisionless A-body simulations of stellar 
dynamics. One is the unavoidable error between the smooth force field of the under- 
lying stellar system modeled and the forces estimated from the positions of N bodies 
(which are sampled from this stellar system) . This estimation error can be reduced 
by increasing the number N of bodies in conjunction with a careful softening: the 
Newtonian Greens function g(x) = G/\x\ is replaced with a non-singular function 
that approaches the Newtonian form for \x\ larger than the softening length e. But 
at fixed A, it cannot be decreased below a certain optimum value [10, 11]. 

The other type of error is introduced by an approximate rather than exact com- 
putation of these estimated forces. While this approximation error can be reduced 
to (almost) any size (at the price of increasing computational effort), it is sufficient 
to reduce it well below the level of the estimation error. We will now first assess 
the approximation error alone and then consider the total error. 

4.1. Accuracy of the Approximation 

We estimate the accuracy of the approximated forces for various choices of the 
parameters controlling the algorithm and for two typical astrophysical situations: 
a spherical Hernquist [12] model galaxy, which has density and force per unit mass 



Mtot r . x GM 



2?r |aj|(r + |aj|) \ x \ (r + \x\) 

and a group of five such galaxies at different positions and with various scale radii 
r . In cither case, we sample N = 10 4 , 10 5 , and 10 6 bodies. We use standard 
Plummcr softening, where g(r) = G/\/ r 2 + e 2 , with softening lengths e chosen such 
as to minimize the estimation error [11]. In order to single out the approximation 
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galaxy galaxy group 




0.001 0.01 0.1 0.001 0.01 0.1 

£ £ 
FIG. 2. The mean (dashed) and 99 percentile (solid) relative force error versus the CPU 
time consumed on a PC (Pentium III/933MHz/Linux) for approximating the forces of a galaxy 
(left) and a group of galaxies (right), sampled with a total of 10 4 (top), 10 5 (middle), or 10 6 
(bottom) bodies. We used constant (open triangles) or mass-dependent tolerance parameter (solid 
squares). The symbols along each curve correspond, from left to right, to 0,# m i n =0.3, 0.4, 0.5, 
0.6, 0.7, 0.8, and 0.9. Cells containing s or less bodies have not been divided. 



error, we compare with the forces obtained by a computation via direct summation 
(in double precision; in case of N = 10 6 for the first 10 5 bodies only). As measure 
for the relative error, we compute for each body [5] 

£ = |&approx ^direct | /^direct j (-^) 

where a denotes the magnitude of the acceleration. Figure 2 plots the mean rel- 
ative error, e, and that at the 99th percentile, £99%, versus the CPU time needed 
by an ordinary PC (Pentium III/933MHz/Linux/compiler: gec version 2.95.2) for 
both constant and mass-dependent tolerance parameter. This figure allows several 
interesting observations. 

1. For the same = const and stellar system, the errors decrease with increasing 
N. This is because, at constant relative force error per individual interaction (as is 
the case for 9 = const), the total error of some body's force scales with the inverse 
square root of the number of individual interactions contributing, which increases 
with N. 
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2. At the same operational effort, as measured by the CPU time, the mass- 
dependent tolerance parameter employing equation (13) results in smaller errors 
than 9 = const, for N > 10 4 . This advantage becomes more pronounced for larger 
N, because, with 9 = const, the absolute force errors of individual interactions 
contributing to some body's force vary stronger with increasing N (due to the 
larger range of cell masses), such that balancing them becomes more beneficial. 

3. Finally, a relative error of £99% of a few per cent or £ of a few 10 -3 at N = 10 5 , 
which is commonly accepted to be sufficient in astrophysical contexts [5], requires 
a tolerance parameter 9 = const ~ 0.65 or # min ~ 0.5. 

4.2. The Total Force Error 

The important question here is for which choices of 9 is the approximation error 
negligible compared to the estimation error? To answer this question, we have 
performed some experiments using samples of N = 10 4 , 10 5 and 10 6 bodies drawn 
from a Hernquist model and computed the mean averaged squared error (MASE) 
of the force (per unit mass) [13]: 

MASE(F) = I J- ]T ft (F t F{ Xi )) 2 j . (16) 

Here, F j and F(x) are, respectively, the approximately estimated force for the 
ith body and the true force field of the stellar system. The curly brackets denote 
the ensemble average over many possible random realizations of the same under- 
lying stellar density by N bodies. We used 10 7 /N ensembles and computed the 
MASE(F') 3 for various values of 9, but always at optimum e [11]. As one might 
already have guessed from the behavior of the approximation error in Figure 2, 
the relative increase of the MASE(F) is negligible: even for 9 = 0.9, the approx- 
imation error contributes less than one per cent, in agreement with the findings 
of [13]. Based on this result, one may advocate the usage of tolerance parameters 
larger than 6* m ; n ~ 0.5. However, the distribution of approximation errors is not 
normal, and the rms error, which is essentially measured by (the square root of) 
the MASE(F'), may well underestimate the danger of using large 9. We therefore 
cannot recommend using 9 mm <; 0.7. 

5. PERFORMANCE TESTS 

5.1. Scaling with N 

We measured the CPU time consumption for both constant 9 — 0.65 and 9 = 9(M) 
with 6> m in = 0.5, using s — 6 in either case. Figure 3 plots the consumed time 
(averaged over many experiments) per body versus N plotted on a logarithmic 
scale. For the case of 9 = 6>(M), Table 1 gives the number of cells as well as 
the number of individual body-body (B-B), cell-body (C-B) and cell-cell (C-C) 
interactions, where the latter two are split into those done via a Taylor expansion 
and direct summation (subscripts 'app' and 'dir'), respectively. 



3 The force field of the Hernquist model has a central singularity, causing a 100% force error 
at r = 0, which cannot be resolved by ./V-body methods. In our experiments, we have therefore 
restricted the summation in equation (16) to |a3j| > e/2. 



10 



W. DEHNEN 




TABLE 1 

Number of Cells and of Interactions for m i n = 0.5, s = 6 and 







the 


same test 


case as in 


Figure 3 






N 




B-B 


C-B a pp 


C-B dir 


C-Capp 


C-Cdir 


total 


1000 


370 


210 


3746 


2049 


4057 


1600 


11662 


3000 


1090 


472 


16870 


5184 


30101 


4472 


57099 


10000 


3558 


753 


61873 


11914 


163554 


11887 


249981 


30000 


10607 


494 


170634 


24984 


572322 


20102 


790474 


100000 


35282 


112 


429035 


73350 


1954508 


53466 


2516146 


300000 


106065 


66 


1041836 


205821 


5457445 


137138 


6859114 


1000000 


353342 


1 


2918326 


621802 


16105065 


393311 


19023391 


3000000 


1060650 


2 


7771584 


1689115 


42974890 


1047627 


53645379 



The tree code requires O(NlogN) operations, corresponding to a rising straight 
line in Fig. 3. For our code, however, there is a turn-over at N ~ 10 , above which 
the CPU time per body approaches a constant 4 (for 9 = const) or even decreases 
with N (for 9 = 9(M)), i.e. the total number of operations becomes O(N) or less, 
which is also evident from the number of interactions in Table 1. 

In order to understand these scalings, let us first consider the simpler case of 
9 = const and a homogeneous distribution of bodies. Then, eight-folding N is 
equivalent to arranging eight copies of the old root cell into the octants of the new 
root cell [1], and the total number of interactions rises from Nj to 8Ni + N + , where 
N + interactions are needed for the mutual forces between these octants. In terms 
of a differential equation, this gives 

dNj ^ Nj AlnN! ^ Ni_ N + 

dN ~ N AlnN ~ N 7V81n8' [ ' 



4 The slow rise can be entirely attributed to the tree-building, which is an 0(N log N) process. 



A HIERARCHICAL 0(N) FORCE CALCULATION ALGORITHM 



11 




100 1000 10 4 10 5 10 6 10 7 

JV 

FIG. 4. CPU time per body versus N for various techniques. The direct, tree and our 
code were all compiled and run with the same equipment. For the new code, we use 6 = 0.65 and 
fmin = 0.5, while for the tree code 9 = 0.8 and p = 3. The scalings for the GRAPE-5 methods 
are taken from [14]. The accuracy requirements for the approximate (non-direct) methods are 
adapted for astrophysical applications. 



where the first term on the right-hand side accounts for the intra-domain and the 
second for the inter-domain interactions. Equation (17) has solution 

N ^ CoN+ ^rsJw dN - (18) 

In the tree code, every body requires a constant number of additional interactions, 
i.e. N + oc N, and the second term in (18) becomes cx iVlniV dominating Ni(N). 
However, in the new code, N + grows sub-linear for large N, since a constant number 
of cell-cell interactions accounts for most new interactions of all bodies. In this case, 
the second term on the right-hand side of equation (18) also grows sub-linear with 
N and Ni will eventually be dominated by the first term. That is, in contrast to 
the tree code, the inter-domain interactions are neglible at large N when compared 
to the intra-domain interactions. The transition value of N will depend on the 
tolerance parameter 9 and the distribution of bodies. 

Another way of estimating the scaling of the computational costs with TV is 
similar to the FMM approach [2, 4]: On each level / of the tree there are ~ 8' cells 
of mass M ~ 8~', i.e. n(M) oc M~ 2 . Each cell has oc 9~ 3 interactions, and thus 

f dM 

"Ww" (19) 

Hence, for 9 = const, Nj oc 1/M m j n oc N, while a shallower scaling results if 9(M) 
increases towards smaller masses. Empirically we find for N > 30000 that the CPU 
time used by the force computation alone (without tree building) is very well fit by 
the power-law oc jV°' 929±0 ' 001 . 



5.2. Comparison with Other Methods used in Stellar Dynamics 

For various computational techniques used in stellar dynamics, Figure 4 plots N 
versus the CPU time consumption normalized by N, both on logarithmic scales. 
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TABLE 2 

Comparison with FMM: Timing Results for Bodies Uniformly Distributed 

in a Cube 



N 


1 FMM 


direct 






rpb 

approx 


rpc 

■*■ direct 


E» 




20000 


13.3 


233 


7.9 ■ 10" 


-4 


0.97 


136 


3.7-10" 


-4 


50000 


27.7 


1483 


5.2 • 10" 


4 


2.64 


924 


3.3 • 10" 


-4 


200000 


158 


24330 


8.4 • 10" 


4 


10.77 


14694 


3.4-10" 


-4 


500000 


268 


138380 


7.0 • 10" 


4 


29.42 


91134 


3.7-10" 


-4 


1000000 


655 


563900 


7.1 • 10" 


4 


58.34 


366218 


3.5 • 10" 


-4 



a Using FMM; data from Table I of Cheng ct al. [4] 

b Using the code presented here on a computer identical to that used by Cheng et al. 
c Using our own implementation of direct summation on the same computer 



An ordinary direct method running on a general-purpose computer is slower than 
our code for any N ;> 100. The GRAPE-5 system [14] obtains a ~ 100 times higher 
performance by wiring elementary gravity into special-purpose hardware and, for 
N <; 10 4 , is faster than any other method. 

The new code presented here is the fastest method running solely on general- 
purpose computers and the only one faster than 0(N log N). In particular, it out- 
performs the popular tree code by a factor of 10 and more. At 30000 <; N <; 3 x 10 7 , 
the only technique that requires less CPU time is a combination of the tree code with 
a GRAPE-5 board [14, 15]. Here, the speed-up due to the usage of special-purpose 
hardware does not quite reach that for direct methods, because of tree-building and 
other overheads that cannot be done on the GRAPE board. 



5.3. Comparison with Fast Multipole Methods 

Because our code relies on cell-cell interactions, it may be considered a variant of 
FMM, introduced by Greengard & Rokhlin [2, 3]. However, it differs in several 
ways from most implementations of FMM. (i) the expansions are centered on the 
cells' centers of mass instead of the geometrical centers; (ii) a Cartesian Taylor 
series is used instead of an expansion in spherical harmonics; (iii) the interaction 
partners are determined by a multipole acceptance criterion rather than by their 
mutual grid position; (iv) the mutuality of interactions is fully exploited; (v) the 
expansion order p is fixed and the accuracy controlled by the parameter 9. 

To assess the effect of these differences, we compared our code directly with the 
3D adaptive FMM code by Cheng, Greengard & Rokhlin [4] . We performed a test 
identical to one reported by these authors (N bodies randomly distributed in a 
cube) on an identical computer (a Sun UltraSPARC with 167MHz) using the same 
error measure (eq. (57) of [4]): 



E : 



E(**-*«) a /E** s 



1 1/2 



(20) 



where $ and $ are the potential computed by direct summation and the approxi- 
mate method, respectively (unfortunately, Cheng et al. do not give the more rele- 
vant error of the accelerations). Table 2 gives, in the last three columns, the CPU 
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time (Tappj-ox) in seconds and error E for our code with 9 = 1 and s = 6 as well as 
the time needed for direct summation in 64bit precision (Tdh-ect); columns 2-4 report 
the data from Table I of [4] . On average our code is faster by more than a factor of 
ten and twice as accurate (even though we compromised the approximation of the 
potential by omitting the octopole contributions). 

What causes this enormous difference? Since for the direct summation the tim- 
ings are much more similar, we can rule out differences in hardware, compiler, etc. 
as cause. An important clue is the fact that our code cannot compete with the 
Cheng et al. FMM in the regime E <^ 10~ 6 . While our code tries to reach this 
goal by decreasing 9, FMM obtains it by increasing p. In general, the accuracy 
as well as the performance are controlled by both the order p and the choice of 
interaction partners, parameterized in our code by 9. Hence, maximal efficiency at 
given accuracy is obtained at a unique choice of (p, 6). Apparently, low orders p 
are optimal for low accuracies, while high accuracies are most efficiently obtained 
with high orders, instead of an increasing number of interactions (decreasing 9). 

Thus, traditional FMM is less useful in the low-accuracy regime, such as needed 
in stellar dynamics, in agreement with earlier findings [5], and our code may be 
called a variant of FMM optimized for low-accuracy. Clearly, however, a code for 
which p and 9 can be adapted simultaneously would be superior to both. 

6. DISCUSSION AND SUMMARY 

Our code for the approximate computation of mutual long-range forces between 
N bodies extends the traditional Barnes & Hut [1] tree code by including cell-cell 
interactions, similar to fast multipolc methods (FMM). However, unlike most im- 
plementations of FMM, our code is optimized for comparably low- accuracy, which 
is sufficient in stellar dynamical applications (§5.3). 

As a unique feature, our code exploits the mutual character of gravity: both 
nodes of any interaction are sink and source simultaneously This results in exact 
conservation of Newton's third law and substantially reduces the computational 
effort, but requires a novel tree- walking algorithm (§3.2) which preserves the natural 
symmetry of each interaction. Note that the generic algorithm given in §3.2 is not 
restricted to long-range force approximations, but can be used for any task that 
incorporates mutuality, for instance neighbor and collision-partner searching. 

6.1. Complexity 

The new code requires O(N) or less operations (Fig. 3) for the approximate compu- 
tation of the forces of N mutually gravitating bodies. For stellar dynamicists, this 
is the first competitive code better than 0(N log N). A complexity of O(N) was 
expected for methods based on cell sinks (implying cell-cell interactions) [2, 4, 7], 
but, to my knowledge, hardly ever shown empirically in three dimensions. 

Our code obtains a complexity of less than O(N) by employing a mass-dependent 
tolerance parameter 9. The traditional 9 = constant results in equal relative errors 
of each interaction, such that the total error of any body's force is dominated by 
the interactions with the most massive cells. By slightly increasing the tolerance 
parameter for less massive cells, we obtain (approximately) equal absolute errors, 
resulting in a lower total force error than the traditional method for the same 
number of interactions. Thus, at the same error, we require less interactions. The 
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additional interactions, arising when increasing N, occur at ever less massive cells 
and hence at ever larger tolerance parameters. This causes the computational costs 
to rise sub-linear with N > 10 4 (depending on the accuracy requirements). 

6.2. Performance 

We have shown that on general-purpose computers our code out-performs any com- 
petitor code commonly used in the field of stellar dynamics. A recent adaptive 3D 
implementation of FMM [4] is also out-performed by a factor of ten (§5.3), which 
is related to the fact that traditional FMM codes appear to be good only in the 
high-accuracy regime. The code presented here was optimized for low accuracies, 
but by increasing the expansion order p, one can easily obtain a version suitable 
for high accuracies, and it remains to be seen how it would perform compared to 
traditional FMM. 

Currently, the only faster method appears to be a GRAPE-supported tree code 
[15], which uses the special-purpose GRAPE hardware [14]. Unfortunately, unlike 
the tree code, our code cannot be combined with the current GRAPE hardware. 
There are, however, no conceptual obstacles against hard-wiring equations (6) into 
special-purpose hardware, which should yield a speed-up comparable to that of tree 
to GRAPE tree, i.e. a factor ~ 50. 



6.3. Publication of the Code 

Our code is written in C++, also includes a purely two-dimensional version (not 
described here), and is linkable to C and FORTRAN programs. The code will be 
publicly available from the author upon request. 

A full Af-body code based on this force algorithm is available under the NEMO 
[16] package (http://bima.astro.umd.edu/nemo). 

APPENDIX A 

Here, we give an error estimate for the Taylor series approximation of gravity. 
Using the integral form of the Taylor series remainder, we find for the remainder 
in equation (2), with A = (x — z\) — (y — z-q) = x — y — R. 

a (p+i) /-i , 

ttp(fl) = —° / dt(l-t)PV {p+1) g(R + At) 

Jo 



p\ 



For Newtonian gravity, 



1 f 1 

- / dt(l-t)PV {p+1) g(R + At) 
V- Jo 

J dt (1 - ty- 1 V^g(R + At) 



(P 



For the summation over the source cell, we find 



< 



< 



(R - |A|) Rp +1 ' 

(p+l)R- P \A\ 
(R- \A\) 2 Rp+ 1 ' 



(A.l) 
(A.2) 

(A.3) 
(A.4) 



£ wA^I <E0ri, maJC ||M(f- fe )|| < (r A , max + r B , m ax) p M B (A.5) 

ViEB k=0 ^ ' 
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with M B the mass of cell B. With 6 > (rA, max + ?"B,max)/-Ri we finally get 

|^ p (<J>b^a)| < Y~6~R (A - 6) 

17? M < (P+ 1 )° P M B (A 7) 

\1l p {V <f> B ^A)\ < ^ _ Q y ( A ") 



APPENDIX B 

Here, we give the interaction details for Algorithm 1 in §3.2.1. Mutual body- 
body interactions are done by elementary gravity, while body self-interactions are 
ignored. Mutual interactions between nodes containing N\ and -/V 2 bodies are 
treated as follows. 

1. \{ N1N2 < N^ c , execute the interaction by direct summation; otherwise, 

2. if the interaction is well-separated, execute it using equations (6); otherwise, 

3. if A1A2 < NP° st , execute the interaction by direct summation; otherwise, 

4. the interaction cannot be executed, but must be split. 

Here, N^ c and AJP° st have different values depending whether it is a cell-body 
(cb) or cell-cell (cc) interaction (see below). Cell self- interactions are done slightly 
differently: 

1. If Ai < N CSl execute the interaction by direct summation; otherwise, 

2. the interaction cannot be executed, but must be split. 

The numbers iV p b ro , 7V p b ost , N£ c , NP° st , and N cs determine the usage of direct 
summation. After some experiments, I found the following values to result in most 
efficient code at given accuracy (but this certainly depends on the implementation 
details). 

_/V p , rc = 3 iV pro = 

Jv cb °i iv cc u i fr> 1 \ 

7V c p b ost = 128, A^P° st = 16, N cs = 64. V ' > 

Note that cell-body interactions with as many as 128 bodies in the cell hardly ever 
occur, since the interaction algorithm favors interactions between roughly equally 
sized nodes. Thus, cell-body interactions will almost always be executed. 
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